By Amy Chen, Jeffrey Liu, Claudia Ye
This is an analysis of the data reported by the Department of Housing Preservation and Development (HPD). HPD issues violations to rental dwelling units that have violated either Housing Maintenance Code or the New York State Multiple Dwelling Law.
Tenants can directly consult their landlord or file an official complaint if an issue is discovered in their apartment. These complaints are directed to the HPD, who in turn contacts the building's managing agent informing them of the complaint. The HPD will then follow up the complaint to ensure that the issue has been resolved and thus would close complaint. If the issue continues to be unresolved, the HPD would send a Code Inspector to check for housing or safety violations. The complaints are categorized into either class A, B, C, and I, with class A being least sever and class C being most severe, such as problems with heating and hot water (Class I cases are unique and indicate a serious hazard). The violations persist on file and the complaint cases remain open until the HPD can confirm that the owner has sufficiently corrected the condition.
Downloading Packages
import pandas as pd
import geopandas as gpd
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import datetime
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
matplotlib.style.use(['seaborn-talk', 'seaborn-ticks', 'seaborn-whitegrid'])
Here is a look at the housing violation data. As of when the file was downloaded, there are 4,955,054 housing violations and the dataset is continuously getting updated. We will take a sample of that dataset and work with that data
#housing = pd.read_csv('Housing_Maintenance_Code_Violations.csv')
housing = pd.read_csv('sample_housing.csv')
sample = housing.sample(frac=0.9)
sample.to_csv('sample_housing.csv')
#housing = housing.drop(['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1', 'Unnamed: 0.1.1.1','Unnamed: 0.1.1.2'], axis=1)
Our sample consists of 157,434 violations (which is approximately 3.1% of the complete dataset)
len(housing)
housing.columns
housing
There are many dates that are being used in this dataset. We will convert the inspection date and approval date into datetime since they are currently strings. By converting them, we will create a new column called "Time Until Approval" to see how long it takes the Department of Building to approve. Within this dataset, 65% of the violations are still open.
To find the percentage of violations that are still open, we create a variable grouped by the column "ViolationStatus", and then count the amount of open and closed violations. Then we divide the number of open violations by the total.
open_close = housing.groupby('ViolationStatus').count()
open_close['ViolationID'].iloc[1] / open_close['ViolationID'].iloc[0]
#Shouldn't it be divided by the total
Here we are converting inspection and approval date into datetime.
housing['InspectionDate'] = pd.to_datetime(housing['InspectionDate'], format="%m/%d/%Y", errors = 'coerce')
housing['ApprovedDate'] = pd.to_datetime(housing['ApprovedDate'], format="%m/%d/%Y", errors = 'coerce')
housing['Time_Until_Approval'] = (abs(housing['ApprovedDate'] - housing['InspectionDate']))
housing['Time_Until_Approval'] = (housing['Time_Until_Approval'] / np.timedelta64(1, 'D')).fillna(0).astype(int)
We will look at data starting from the 1980s since there aren't many data points in the file for anything before then. Within the last 30-40 years, there seems to be an exponential growth in housing violations. The graph that is displayed shows the change over time in years.
housing_past_1980 = housing[(housing.InspectionDate > '1980-01-01')]
matplotlib.pyplot.xlabel("Year")
housing_past_1980.InspectionDate.value_counts().sort_index().resample('AS').mean().plot()
We will now look at how many unique violations there are. The order number references to the abstract description of the violation condition which cites a specific section of the law which is in violation. From the code below, we find that there are 389 unique violation. That's 4955054:389 individual violations to violation code. 508 is the most popular order number.
To find the number of unique orders, we converted the order in the column "OrderNumber" into a string.
housing['OrderNumber'] = [str(order) for order in housing['OrderNumber']]
print(housing['OrderNumber'].nunique())
This graph displays the ten most popular violations by their order numbers.
orderNumberViolation = housing[['ViolationID', 'OrderNumber']].groupby('OrderNumber').count().rename(columns={"ViolationID":"Count"})
orderNumberViolation.sort_values('Count', ascending=False).head(10).plot(kind='bar')
We will now look at the number of violations in each borough. Here, Brooklyn has the most violation. The violations are also plotted on the NYC map which I got from geojson. Data without a latitude and longitude are dropped since they were causing errors. The lighter the purple, the less concentration of violation.
housing = housing.dropna(subset=['Latitude', 'Longitude'])
matplotlib.pyplot.xlabel("Violation Count")
housing.Borough.value_counts().plot(kind='barh', figsize=(12,4))
!curl 'https://data.cityofnewyork.us/api/geospatial/cpf4-rkhq?method=export&format=GeoJSON' -o nyc-neighborhoods.geojson
df_nyc = gpd.GeoDataFrame.from_file('nyc-neighborhoods.geojson')
ny_map = df_nyc.plot(linewidth=0.5, color='White', edgecolor = 'Black', figsize = (20,15), alpha=0.5)
housing_loc = housing.sample(frac=0.7).plot.scatter(
x="Longitude",
y="Latitude",
figsize=(20,15),
s=0.3,
color='purple',
alpha=0.1,
ax=ny_map
)
The following graph below takes 10% of the sample data and plots a KDE map from it. The map may be a bit inaccurate, but generally speaking, it should be okay because higher concentrations of violation will still be mapped out.
ny_map = df_nyc.plot(linewidth=0.5, color='White',edgecolor = 'Black', figsize = (20,15), alpha=0.5)
sample = housing.sample(frac=0.1)
sns.kdeplot(
sample.Longitude, sample.Latitude,
gridsize=100,
cmap=plt.cm.Purples,
shade_lowest=False,
n_levels=30,
ax=ny_map
)
The following graph represents the time it takes for the department of building to approve a violation depending on the class of the violation.
A class violation is the least severe and the C class is the most severe. I class is considered hazardous.
We will disregard Staten Island and the Bronx here because of they have less cases of buildings with housing violations.
In the graph repsented below, A class violations (least severe) take the most time to get approved in Queens, followed by Brooklyn, then Manhattan. The C class violation takes the longest to get approved in Brooklyn, then Queens, then Manhattan.
In Manhattan, the approval rate is generally much faster than the other boroughs. Queens seems to have the slowest approval rate in all cases except class C, with Brooklyn being the slowest in that case.
To obtain this bar graph, we take "Time_Until_Approval" and graph it against "Class".
approval = housing[['Time_Until_Approval', 'Class']].set_index("Class")
approval_time= pd.pivot_table(
data = housing.sample(3000),
index='Borough',
columns='Class',
values='Time_Until_Approval'
).plot.barh()
#What are the units for Time Until Approval?
There's not a lot of I class (hazardous) violations compared to the other 3 classes, so let's look into that. Clearly from the KDE map and the value_counts, Brooklyn has the greatest number of I class violations. The top 9 streets with at least 1000 hazardous violations are found in Brooklyn's street of
To obtain this data, we first create a new dataframe which we named "classI", which contains the 'HouseNumber', 'StreetName', 'Borough', 'Latitude', and 'Longitude' columns for only the class I violations.
classI = housing.loc[housing['Class']=="I", ['HouseNumber', 'StreetName', 'Borough', 'Latitude', 'Longitude']]
classI
Then we found which Borough had the most class I violations, which turned out to be Brooklyn.
classI['Borough'].value_counts()
Finally, we obtained a list containing the top ten streets for class I violations.
classI["StreetName"].value_counts().head(10)
This is a map of all the I class violations found in NYC
ny_map = df_nyc.plot(linewidth=0.5, color='White', edgecolor = 'Black', figsize = (20,15), alpha=0.5)
sns.kdeplot(
classI.Longitude, classI.Latitude,
gridsize=100,
cmap=plt.cm.Purples,
shade=False,
shade_lowest=False,
n_levels=10,
ax=ny_map
)
complaints = pd.read_csv("https://data.cityofnewyork.us/api/views/uwyv-629c/rows.csv?accessType=DOWNLOAD")
len(complaints)
Take a 10% sample of the data to make running code easier.
complaints_sample = complaints.sample(frac=0.1)
len(complaints_sample)
complaints_sample.columns
complaints_sample
Here we are looking for the proportion of open and closed complaints.
complaints_status = complaints_sample.groupby('Status').count()
complaints_status['StatusID'].iloc[1]/complaints_status['StatusID'].iloc[0]
complaints_sample['ReceivedDate'] = pd.to_datetime(complaints_sample['ReceivedDate'], format="%m/%d/%Y", errors = 'coerce')
complaints_sample['StatusDate'] = pd.to_datetime(complaints_sample['StatusDate'], format="%m/%d/%Y", errors = 'coerce')
complaints_sample['Time_Until_Approval'] = (abs(complaints_sample['StatusDate'] - complaints_sample['ReceivedDate']))
complaints_sample['Time_Until_Approval'] = (complaints_sample['Time_Until_Approval'] / np.timedelta64(1, 'D')).fillna(0).astype(int)
matplotlib.pyplot.xlabel("Year")
complaints_sample.StatusDate.value_counts().sort_index().resample('AS').mean().plot()
There was not enough data in the sample to form a continuous plot, so this is a plot of the number of complaints over time. Keep in mind that the data does not reach back as far as that of violations.
complaints['ReceivedDate'] = pd.to_datetime(complaints['ReceivedDate'], format="%m/%d/%Y", errors = 'coerce')
complaints['StatusDate'] = pd.to_datetime(complaints['StatusDate'], format="%m/%d/%Y", errors = 'coerce')
#complaints_trend = complaints_sample[(complaints_sample.StatusDate > '2015-01-01')]
matplotlib.pyplot.xlabel("Year")
complaints.StatusDate.value_counts().sort_index().resample('AS').mean().plot()
Let's take a look at the number of complaints in the five boroughs. The distribution mimics that of the violations above, where Brooklyn has the most number of complaints.
matplotlib.pyplot.xlabel("Complaints Count")
complaints_sample.Borough.value_counts().plot(kind='barh', figsize=(12,4))